🐛 Do not update MS status when unable to get workload cluster or machine node #10436

jessehu · 2024-04-15T04:29:50Z

What this PR does / why we need it:
The ErrClusterLocked error causes the MD.Status.ReadyReplicas changes from 3 to 0 and after about 90s it will be changed back to 3. The reason is updateStatus() in machineset_controller.go ignores the error returned by getMachineNode() and treats the Node as not ready. Setting MD.Status.ReadyReplicas from 3 to 0 is unreasonable and causes issues in our project on top of CAPI.

Which issue(s) this PR fixes:
Fixes #10195

Changes
Do not update any fields of MS.Status when unable to get workload cluster or machine Node due to ErrClusterLocked or any other errors. Because the ErrClusterLocked error can be recovered soon after reconciling again, and the error that cannot get machine Node (e.g. network issue, or apiserver unavailable temporarily) should also be recovered soon after reconciling again.
Please refer to more discussion on #10229 (comment).

Test:

The UT code for existing MS.Status update logic does not contain fake Reconciler.Tracker object and not consider the getNode logic. So it needs more changes to add UT for this error scenario.
Ran e2e test for 3 CP & 3 Worker cluster creation with updating CAPI components version(based on CAPI 1.5.2) after the cluster created, observed the following expected behavior:
MD.Status.ReadyReplicas won't change from 3 to 0 when getMachineNode() hit ErrClusterLocked error. The log "Requeuing because another worker has the lock on the ClusterCacheTracker" was printed.

@sbueringer @fabriziopandini please kindly take a look again. Thanks for you time!

k8s-ci-robot · 2024-04-15T04:29:58Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign chrischdi for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2024-04-15T04:29:59Z

Hi @jessehu. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jessehu · 2024-04-15T04:30:43Z

/area machineset

sbueringer · 2024-04-15T11:10:00Z

/ok-to-test

sbueringer · 2024-04-15T11:11:33Z

internal/controllers/machineset/machineset_controller.go

@@ -883,8 +888,7 @@ func (r *Reconciler) updateStatus(ctx context.Context, cluster *clusterv1.Cluste

 		node, err := r.getMachineNode(ctx, cluster, machine)
 		if err != nil && machine.GetDeletionTimestamp().IsZero() {
-			log.Error(err, "Unable to retrieve Node status", "node", klog.KObj(node))
-			continue
+			return errors.Wrapf(err, "unable to retrieve the status of Node %s", klog.KObj(node))


I think this doesn't address:

For MachineSet specifically: It definitely solves the "happy path", but it makes the "unhappy path" worse by freezing the status indefinitely. Without the changes the ready and available replicas were dropping to 0 when the communication broke down. Probably a matter of preference if in case of the communication breakdown we prefer to keep ready and available replicas as is or drop them to 0. But another side effect is also that we don't update ObservedGeneration anymore, which definitely seems wrong (similar for the conditions)

Specifically:

"unhappy path" worse by freezing the status indefinitely

But another side effect is also that we don't update ObservedGeneration anymore, which definitely seems wrong (similar for the conditions)

Thanks @sbueringer.

"unhappy path" worse by freezing the status indefinitely

This PR does not update any fields of MS.Status when unable to get workload cluster or machine Node due to ErrClusterLocked or any other errors. Because the ErrClusterLocked error can be recovered soon after reconciling again, and the error that cannot get machine Node (e.g. network issue, or apiserver unavailable temporarily) should also be recovered soon after reconciling again. Are there other "unhappy path" you meant?

But another side effect is also that we don't update ObservedGeneration anymore, which definitely seems wrong (similar for the conditions).

Since this PR returns error at line 891, so it won't update ms.Status at line 904 and newStatus.ObservedGeneration at line 924. This is as expected.

and the error that cannot get machine Node (e.g. network issue, or apiserver unavailable temporarily) should also be recovered soon after reconciling again. Are there other "unhappy path" you meant?

Yup, apiserver down for a longer period of time / indefinitely

In case the apiserver goes down for a longer period of time or indefinitely, KCP controller should catch it and handle correctly.

KCP won't always be able to recover (if that's what you meant)

Hi @fabriziopandini as discussed with @sbueringer in current thread, IMHO this PR provides a simple patch to solve the MS status update issue described in #10195, and there would be no need to introduce the suggested handling code (provided by @sbueringer and IMHO a little complex)

I had replied my thought in 🐛 Do not update KCP and MS status when unable to get workload cluster #10229 (comment) and also removed the KCP status handling code as you suggested.

In case the apiserver goes down for a longer period of time or indefinitely, the MS status won't be updated, but KCP controller should catch it and handle correctly. In this case updating MS status won't help.

BTW I was on vacation last week so sorry for the late reply. Thanks!

I will look into it as soon as possible, but we already invested a lot of time in trying to find an acceptable way forward.

I tried to express above that it's not acceptable that the MS status is permanently not updated anymore when a workload cluster stays unreachable

As I described that
2) in case the apiserver goes down for a longer period of time or indefinitely, the MS status won't be updated, but KCP controller should catch it and handle correctly. In this case updating MS status won't help.
3) Even if we want to update MS status in this special case, it would be not good to set th MS ready replicas to 0, instead IMHO a new status UnknownReplicas should be added (but this changes the current API definition and is not a trivial change).

Sure updating the MS status won't contribute to KCP trying to remediate the problem, but it will actually keep the status up-to-date. The status is much more than just the replica fields

…e node

k8s-triage-robot · 2024-08-15T08:21:54Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-09-14T08:59:33Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-10-14T09:42:24Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen
Mark this PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2024-10-14T09:42:29Z

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen

Mark this PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area PR is missing an area label labels Apr 15, 2024

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 15, 2024

k8s-ci-robot requested review from jackfrancis and stmcginnis April 15, 2024 04:30

k8s-ci-robot added area/machineset Issues or PRs related to machinesets and removed do-not-merge/needs-area PR is missing an area label labels Apr 15, 2024

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 15, 2024

sbueringer reviewed Apr 15, 2024

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 17, 2024

Do not update MS status when unable to get workload cluster or machin…

dfcce59

…e node

jessehu force-pushed the ms-status-update branch from 666cd52 to dfcce59 Compare May 17, 2024 07:29

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 17, 2024

jessehu mentioned this pull request Jul 26, 2024

MD.Status.ReadyReplicas changes from 3 to 0 when machineset_controller updateStatus() hits "Unable to retrieve Node status" error #10195

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 15, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 14, 2024

k8s-ci-robot closed this Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Do not update MS status when unable to get workload cluster or machine node #10436

🐛 Do not update MS status when unable to get workload cluster or machine node #10436

jessehu commented Apr 15, 2024

k8s-ci-robot commented Apr 15, 2024

k8s-ci-robot commented Apr 15, 2024

jessehu commented Apr 15, 2024

sbueringer commented Apr 15, 2024

sbueringer Apr 15, 2024 •

edited

Loading

jessehu Apr 15, 2024

sbueringer Apr 15, 2024

jessehu Apr 16, 2024

sbueringer Apr 16, 2024

jessehu May 9, 2024 •

edited

Loading

fabriziopandini May 9, 2024

sbueringer May 13, 2024 •

edited

Loading

jessehu May 14, 2024

sbueringer May 14, 2024

k8s-triage-robot commented Aug 15, 2024

k8s-triage-robot commented Sep 14, 2024

k8s-triage-robot commented Oct 14, 2024

k8s-ci-robot commented Oct 14, 2024

🐛 Do not update MS status when unable to get workload cluster or machine node #10436

🐛 Do not update MS status when unable to get workload cluster or machine node #10436

Conversation

jessehu commented Apr 15, 2024

k8s-ci-robot commented Apr 15, 2024

k8s-ci-robot commented Apr 15, 2024

jessehu commented Apr 15, 2024

sbueringer commented Apr 15, 2024

sbueringer Apr 15, 2024 • edited Loading

Choose a reason for hiding this comment

jessehu Apr 15, 2024

Choose a reason for hiding this comment

sbueringer Apr 15, 2024

Choose a reason for hiding this comment

jessehu Apr 16, 2024

Choose a reason for hiding this comment

sbueringer Apr 16, 2024

Choose a reason for hiding this comment

jessehu May 9, 2024 • edited Loading

Choose a reason for hiding this comment

fabriziopandini May 9, 2024

Choose a reason for hiding this comment

sbueringer May 13, 2024 • edited Loading

Choose a reason for hiding this comment

jessehu May 14, 2024

Choose a reason for hiding this comment

sbueringer May 14, 2024

Choose a reason for hiding this comment

k8s-triage-robot commented Aug 15, 2024

k8s-triage-robot commented Sep 14, 2024

k8s-triage-robot commented Oct 14, 2024

k8s-ci-robot commented Oct 14, 2024

sbueringer Apr 15, 2024 •

edited

Loading

jessehu May 9, 2024 •

edited

Loading

sbueringer May 13, 2024 •

edited

Loading